Code Switch Point Detection in Arabic

نویسندگان

  • Heba Elfardy
  • Mohamed Al-Badrashiny
  • Mona T. Diab
چکیده

This paper introduces a dual-mode stochastic system to automatically identify linguistic code switch points in Arabic. The first of these modes determines the most likely word tag (i.e. dialect or modern standard Arabic) by choosing the sequence of Arabic word tags with maximum marginal probability via lattice search and 5-gram probability estimation. When words are out of vocabulary, the system switches to the second mode which uses a dialectal Arabic (DA) and modern standard Arabic (MSA) morphological analyzer. If the OOV word is analyzable using the DA morphological analyzer only, it is tagged as “DA”, if it is analyzable using the “MSA” morphological analyzer only, it is tagged as MSA, otherwise if analyzable using both of them, then it is tagged as “both”. The system yields an Fβ=1 score of 76.9% on the development dataset and 76.5% on the held-out test dataset, both judged against human-annotated Egyptian forum data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The George Washington University System for the Code-Switching Workshop Shared Task 2016

We describe our work in the EMNLP 2016 second code-switching shared task; a generic language independent framework for linguistic code switch point detection (LCSPD). The system uses characters level 5-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We participated in the Modern Standard Arabic (MS...

متن کامل

The Production of Code-Mixed Discourse

We propose a comprehensive theory of codemixed discourse, encompassing equivalencepoint and insertional code-switching, palindromic constructions and lexical borrowing. The starting point is a production model of code-switching accounting for empirical observations about switch-point distribution (the equivalence constraint), well-formedness of monolingual fragments, conservation of constituent...

متن کامل

Using Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media

Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...

متن کامل

LILI: A Simple Language Independent Approach for Language Identification

We introduce a generic Language Independent Framework for Linguistic Code Switch Point Detection. The system uses the word length, character level (1, 2, 3, 4, and 5)-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We test our proposed framework and compare it to the state-of-theart published syste...

متن کامل

Processing Code-Switching in Algerian Bilinguals: Effects of Language Use and Semantic Expectancy

Using a cross-modal naming paradigm this study investigated the effect of sentence constraint and language use on the expectancy of a language switch during listening comprehension. Sixty-five Algerian bilinguals who habitually code-switch between Algerian Arabic and French (AA-FR) but not between Standard Arabic and French (SA-FR) listened to sentence fragments and named a visually presented F...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013